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(54) Apparatus and method for packet-based media communications 



(57) Packet-based central conference bridges, 
packet-based network interfaces and packet-based ter- 
minals are used for Voice communications over a pack- 
et-based network. Modifications to these apparatuses 
can reduce the latency and the signal processing re- 
quirements while increasing the signal quality within a 
voice conference as well as point-to-point communica- 
tions. For instance, by selecting the talkers prior to the 
decompression of the voice signals, decreases in the 



latency and increases in signal quality within the voice 
conference can result due to a possible removal of the 
decompression and subsequent compression opera- 
tions in a conference bridge unnecessary in some cir- 
cumstances. Further, the removal of the jitter buffers 
within the conference bridges and the moving of the mix- 
ing operation to the individual terminals and/or network 
interfaces are modifications that can cause lower laten- 
cy and transcoding within the voice conference. 
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Description 

FIELD OF THE INVENTION 

[0001] This invention relates generally to packet- 5 
based media communications and more specifically to 
media conferencing within a packet-based communica- 
tion network. 

BACKGROUND OF THE INVENTION 

[0002] Prior to the use of packet-based voice commu- 
nications, telephone conferences were a service option 
available within standard non-packet-based telephone 
networks such as Pulse Code Modulation (PCM) tele- 
phone networks. As depicted in FIGURE 1 , a standard 
telephone switch 20 is coupled to a plurality of telephone 
handsets 22 to be included within a conference session 
as well as a central conference bridge 24. It is noted that 
these telephone handsets 22 are coupled to the tele- 
phone switch 20 via numerous other telephone switches 
(not shown). The telephone switch 20 forwards any 
voice communications received from the handsets 22 to 
the central conference bridge 24, which then utilizes a 
standard algorithm to control the conference session. 
[0003] One such algorithm used to control a confer- 
ence session, referred to as a "party line" approach, 
comprises the steps of mixing the voice communica- 
tions received from each telephone handset 22 within 
the conference session and further distributing the result 
to each of the telephone handsets 22 for broadcasting. 
A problem with this algorithm is the amount of noise that 
is combined during the mixing step, this noise compris- 
ing a background noise source corresponding to each 
of the telephone handsets 22 within the conference ses- 
sion. 

[0004] An improved algorithm for controlling a confer- 
ence session is disclosed within European patent appli- 
cation 97310458.1 entitled "Method of Providing Con- 
ferencing in Telephony" by Dal Farra et al, filed on De- 
cember 22, 1997, assigned to the assignee of the 
present invention. This algorithm comprises the steps 
of selecting primary and secondary talkers, mixing the 
voice communications from these two talkers and for- 
warding the result of the mixing to all the participants 
within the conference session except for the primary and 
secondary talkers; the primary and secondary talkers 
receiving the voice communications corresponding to 
the secondary and primary talkers respectively. The se- 
lection and mixing of only two talkers at any one time 
can reduce the background noise level within the con- 
ference session when compared to the "party line" ap- 
proach described above. 

[0005] In a standard PCM telephone network as is de- 
picted in FIGURE 1 , all of the voice communications are 
in PCM format when being received at the central con- 
ference bridge 24 and when being sent to the individual 
telephone handsets 22. Hence, in this situation, the mix- 



ing of the voice communications corresponding to the 
primary and secondary talkers is relatively simple with 
no conversions of format required. 
[0006] Currently, packet-based voice communica- 
tions are being utilized more frequently as Voice-over- 
Internet Protocol (VoIP) becomes increasingly popular. 
In these standard Vol P voice communications, voice da- 
ta in PCM form is being encapsulated with a header and 
footer to form voice data packets; the header in these 
packets having, among other things, a Real Time Pro- 
tocol (RTP) header that contains a time stamp corre- 
sponding to when the packet was generated. One area 
that requires considerable improvement is the use of 
packet-based voice communications to perform tele- 
phone conferencing capabilities. 

[0007] As depicted within FIGURE 2, a plurality of 
packet-based voice communication terminals, VoIP 
handsets 26 in this case, are coupled to a packet-based 
network, an IP network 28 in this case. Currently, in or- 
der for the users of these VoIP handsets 26 to commu- 
nicate within a voice conference, a packet-based voice 
communication central bridge, in this case a VoIP cen- 
tral conference bridge 30, must be coupled to the I P net- 
work 28. This VoIP central conference bridge 30 has a 
number of problems, the key problems being the latency 
inherently created within the conference bridge 30 and 
the considerable amount of signal processing power re- 
quired. It should be noted that the high signalling power 
required is partially due to the conference bridge having 
to compensate for a variety of problems that typically 
exist within current IP networks; these problems includ- 
ing possible variable delays, out-of-sequence packets, 
lost packets, and/or unbounded latency. 
[0008] FIGURE 3A is a logical block diagram of a well- 
known VoIP central conference bridge design while FIG- 
URE 3B is a logical block diagram of a well-known VoIP 
handset design. In the design of FIGURE 3A, the con- 
ference bridge 30 comprises an-inputting block 32, a 
talker selection and mixing block 34, and an outputting 
block 36. Typically all three of these blocks are imple- 
mented in software. 

[0009] The inputting block 32 comprises, for each par- 
ticipant within the voice conference, a protocol stack (P. 
S.) 38 coupled in series with a jitter buffer (J.B.) 40 and 
a decompression block (DECOMP.) 42, each of the de- 
compression blocks 42 further being coupled to the talk- 
er selection and mixing block 34. The protocol stacks 
38 in this design perform numerous functions including 
receiving packets comprising compressed voice sig- 
nals, hereinafter referred to as voice data packets; strip- 
ping off the packet overhead required for transmitting 
the voice data packet through the IP network 28; and 
outputting the compressed voice signals contained with- 
in the packets to the respective jitter buffer 40. The jitter 
buffers 40 receive these compressed voice signals; en- 
sure that the compressed voice signals are within the 
proper sequence (i.e. time ordering signals); buffer the. 
compressed voice signals to ensure smooth playback; 
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and ideally implement packet loss concealment. The 
output of each of the jitter buffers 40 is a series of com- 
pressed voice signals within the proper order that are 
then fed into the respective decompression block 42. 
The decompression blocks 42 receive these com- 
pressed voice signals, convert them into standard PCM 
format and output the resulting voice signals (that are in 
Pulse Code Modulation) to the talker selection and mix- 
ing block 34. 

[0010] The talker selection and mixing block 34 pref- 
erably performs almost identical functionality to the cen- 
tral conference bridge 24 within FIGURE 1 . The key to 
the design of a VoIP central conference bridge 30 as 
depicted in FIGURE 3A is the inputting block 32 trans- 
forming the packet-based voice communications into 
PCM voice communications so the well-known confer- 
encing algorithms can be utilized within the block 34. As 
described previously, in one conferencing algorithm, pri- 
mary and secondary talkers are selected for transmis- 
sion to the participants in the conference session to re- 
duce the background noise level from participants who 
are not talking and to simplify the mixing algorithm re- 
quired. Hence, the resulting output from the talker se- 
lection and mixing block 34 is a voice communication 
consisting of a mix between the voice communications 
received from a primary talker and a secondary talker; 
the primary and secondary talkers being determined 
within the block 34. Further outputs from the talker se- 
lection and mixing block 34 include the unmixed voice 
communications of the primary and secondary talkers 
that are to be forwarded, as described previously, to the 
secondary and primary talkers respectively. 
[0011] The outputting block 36 comprises three com- 
pression blocks 44 and a plurality of transmitters 46. The 
compression blocks 44 receive respective ones of the 
three outputs from the talker selection and mixing block 
34, compress the received voice signals, and independ- 
ently output the results to the appropriate transmitters 
46. In this case, the mixed voice signals, after being 
compressed, are forwarded to all the transmitters 46 
with the exception of the transmitters directed to the pri- 
mary and secondary talkers. The transmitters directed 
to the primary and secondary talkers receive the appro- 
priate unmixed voice signals. Each of the transmitters 
46, after receiving a compressed voice signal, subse- 
quently encapsulates this compressed voice signal 
within the packet-based format required for transmis- 
sion on the IP network 28 and transmits a voice data 
packet comprising the compressed voice signal to the 
appropriate VoIP handset 26 within the conference ses- 
sion. 

[0012] The well-known handsets 26, as depicted in 
FIGURE 3B, each comprise a protocol stack 47 coupled 
in series with a jitter buffer 48 and a decompression 
block 49, these blocks typically being implemented in 
software. Voice data packets sent from the central con- 
ference bridge 30 are received at the protocol stack 47 
which subsequently removes the packet overhead from 



the received voice data packets, leaving only the com- 
pressed voice signal sent from the packet-based central 
conference bridge 30. The jitter buffer 48 next performs 
numerous functions similar to those performed by the 

5 jitter buffers 40 including ensuring that the compressed 
voice signals are within the proper sequence, buffering 
the compressed voice signals to ensure smooth play- 
back, and ideally implementing packet loss conceal- 
ment. Subsequently, the decompression block 49 re- 

w ceives the compressed voice signals, decompresses 
them into PCM format, and forwards the voice signals 
to the speaker within the particular handset 26 for broad- 
casting the voice signals audibly. 
[0013] One key problem with the setup depicted with- 

15 in FIGURES 3A and 3B is the degradation of the voice 
signals as the voice signals are converted from PCM 
format to compressed format and vice versa, these con- 
versions together being referred to generally as trans- 
coding. A further problem results from the considerable 

20 latency that the processing within the VoIP central con- 
ference bridge 30 and the processing within the individ- 
ual handsets 26 create. The combined latency of this 
processing can result in a significant delay between 
when the talker(s) speaks and when the other partici- 

25 pants in the conference session hear the speech. This 
delay can be noticeable to the participants if it is beyond 
the perceived real-time limits of human hearing. This 
could result in participants talking while not realizing that 
another participant is speaking. Yet another key problem 

30 with the design depicted in FIGURES 3A and 3B is the 
considerable amount of signal processing power that is 
required to implement the conference bridge 30. As stat- 
ed previously, each of the components shown within 
FIGURE 3A are normally simply software algorithms be- 

35 jng run on DSP components(s). This considerable 
amount of required signal processing power is expen- 
sive. 

[0014] Hence, a new design within a packet-based 
voice communication network is required to implement 
40 voice conferencing functionality. In this new design, a 
reduction in transcoding, latency, and/or required signal 
processing power within the central conference bridge 
is needed. 

45 SUMMARY OF THE INVENTION 

[0015] The present invention is directed to packet- 
based central conference bridges and other packet- 
based components, such as packet-based network in- 

so terfaces and packet-based terminals, that could be used 
for media communications over a packet-based net- 
work, these media communications preferably being 
voice communications. The apparatus of the present in- 
vention can preferably allow for voice conferences as 

55 well as point-to-point communications to be established 
within the packet-based network with a reduction in 
transcoding, latency and/or signal processing require- 
ment. 
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[0016] Some embodiments of the present invention 
decrease the latency within a voice conference by se- 
lecting the talkers prior to the decompression of the 
voice signals, hence making the decompression and 
subsequent compression operations in a conference 
bridge unnecessary in some circumstances. Further, 
the removal of the jitter buffers within the conference 
bridges and the moving of the mixing operation to the 
individual packet-based components are both included 
within embodiments of the present invention. These 
modifications preferably make for increased perform- 
ance within the system by decreasing transcoding and 
latency within a conference session and result in de- 
creased costs by reducing the required signal process- 
ing power for the system. Yet further, the modifications 
within the conference bridge allow for increased func- 
tionality such as an interlocking configuration of confer- 
ence bridges and three way calling without the use of a 
conference bridge at all. 

[0017] The present invention, according to a first 
broad aspect, is a conference bridge, including a receiv- 
er and a energy detection and talker selection unit. The 
receiver is capable of being coupled to a network and 
operates to receive at least one media data packet from 
at least two sources forming a media conference, each 
media data packet defining a compressed media signal. 
The energy detection and talker selection unit is coupled 
to the receiver and operates to determine at least one 
speech parameter corresponding to each of the com- 
pressed media signals and select a set of the sources 
within the media conference as talkers based on the de- 
termined speech parameters. 

[0018] According to a second broad aspect, the 
present invention is a conference bridge that includes a 
receiver, an energy detection and talker selection unit 
and an output unit. The receiver is capable of being cou- 
pled to a network and operates to receive at least one 
media data packet from at least two sources forming a 
media conference, each media data packet defining a 
compressed media signal. The energy detection and 
talker selection unit is coupled to said receiver and op- 
erates to process the received compressed media sig- 
nals including selecting a set of the sources within the 
media conference as talkers, one of the talkers being a 
lead talker. And, the output unit is coupled to the energy 
detection and talker selection unit and operates to out- 
put media data packets that correspond to compressed 
media signals received from the talkers. In this aspect, 
the media data packets corresponding to the lead talker 
are always output from the conference bridge in the 
same order as the media data packets which are re- 
ceived from the lead talker. 

[0019] Other aspects and features of the present in- 
vention will become apparent to those ordinarily skilled 
in the art upon review of the following description of spe- 
cific embodiments of the invention in conjunction with 
the accompanying figures. 



BRIEF DESCRIPTION OF THE DRAWINGS 

[0020] The preferred embodiment of the present in- 
vention is described with reference to the following fig- 
5 ures, in which: 

FIGURE 1 is a simplified block diagram illustrating 
a well-known non-packet-based telephone network 
with a voice conferencing capability; 
FIGURE 2 is a simplified block diagram illustrating 
a well-known packet-based network with a voice 
conferencing capability; 

FIGURES 3A and 3B are logical block diagrams il- 
lustrating a well-known packet-based central con- 
ference bridge and a well-known packet-based 
handset respectively implemented within the pack- 
et-based network of FIGURE 2; 
FIGURE 4 is a simplified block diagram illustrating 
a packet-based central conference bridge accord- 
ing to first and second preferred embodiments of 
the present invention; 

FIGURE 5 is a flow chart illustrating the operations 
preferably performed by a packet receipt block and 
an energy detection and talker selection block im- 
plemented within the packet-based central confer- 
ence bridge of FIGURE 4; 

FIGURE 6 is a flow chart illustrating the operations 
performed, according to the first preferred embodi- 
ment, by an output generator implemented within 
the packet-based central conference bridge of FIG- 
URE 4; 

FIGURE 7 is a logical block diagram illustrating the 
packet-based central conference bridge of FIGU RE 
4 during a sample operation; 
FIGURE 8 is a flow chart illustrating the operations 
performed, according to the second preferred em- 
bodiment, by an output generator implemented 
within the packet-based central conference bridge 
of FIGURE 4; 

FIGURE 9 is a logical block diagram illustrating the 
packet-based central conference bridge of FIGU RE 
4 during a sample operation; 
FIGURE 1 0 is a simplified block diagram illustrating 
a packet-based handset according to the second 
preferred embodiment of the present invention; 
FIGURES 11 is a logical block diagram illustrating 
the packet-based handset of FIGURE 10 during a 
sample operation; 

FIGURES 12A, 12B and 12C are block diagrams 
illustrating sample operations of a network compris- 
ing a series of interlocked packet-based central 
conference bridges according to an embodiment of 
the present invention; and 

FIGURE 1 3 is a simplified block diagram illustrating 
a well-known packet-based network coupled to a 
well-known PCM telephone network with a voice 
conferencing capability. 
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DETAILED DESCRIPTION OF THE PREFERRED 
EMBODIMENT 

[0021] The present invention is directed to a number 
of different methods and apparatuses that can be uti- 
lized within a packet-based voice communication sys- 
tem. Primarily, the embodiments of the present inven- 
tion are directed to methods and apparatus used for 
voice conferences within packet-based communication 
networks, but this is not meant to limit the scope of the 
present invention. 

[0022] One skilled in the art would understand that 
there are two essential sectors for the operations of a 
telephone session. These sectors include a control 
plane that performs administrative functions such as ac- 
cess approval and build-up/tear-down of telephone ses- 
sions and/or conference sessions and a media plane 
which performs the signal processing required on media 
(voice or video) streams such as format conversions 
and mixing operations. As described below, the present 
invention is applicable to modifications within the media 
plane which could be implemented with a variety of dif- 
ferent control planes while remaining within the scope 
of the present invention. 

[0023] One significant aspect of the present invention 
described herein below is directed to a packet-based 
central conference bridge coupled to a packet-based 
network for enabling voice conferences between nu- 
merous sources of media signals. These sources of me- 
dia signals can be any terminal that a person can output 
media data for transmission to the conference bridge 
and can input media data from the conference bridge. 
In preferred embodiments, these sources of media sig- 
nals are packet-based terminals coupled to a packet- 
based network, such is illustrated for the VoIP handsets 
26 coupled to the IP network 20 within FIGURE 2. In 
other embodiments, one or more of the sources of me- 
dia signals are other terminals such as standard non- 
packet-based telephone terminals, such as PCM or an- 
alog telephone terminals, that are coupled to a packet- 
based network via a packet-based network interface. 
This situation is illustrated for in FIGURE 13 in which a 
non-packet-based telephone network, in this case PCM 
telephone network 150, is coupled to a packet-based 
network, in this case IP network 28, via a packet-based 
network interface, in this case IP Gateway 152. As 
shown in FIGURE-13, a number of standard PCM tele- 
phone handsets 154 are coupled to the PCM telephone 
network 150, these PCM telephone handsets 154 pos- 
sibly being considered as sources of media signals with- 
in the preferred embodiments of the present invention. 
Further, sources_of media signals could be other devices 
that allow for the inputting and outputting of media data, 
this media data being in the form of media data packets 
when it is received at/sent from the packet -based central 
conference bridge described for preferred embodiments 
of the present invention. 

[0024] In the following description, it should be under- 



stood that despite referring to the sources of media sig- 
nals as packet-based terminals throughout this docu- 
ment, such references could alternatively be directed to 
another form of media signal source. Further, the follow- 
5 ing description of the preferred embodiments of the 
present invention is specific to voice data packets that 
contain compressed voice signals, though this should 
not limit the scope of the present invention as is de- 
scribed in further detail herein below. 
w [0025] FIGURE 4 illustrates a simplified block dia- 
gram, according to first and second preferred embodi- 
ments of the present invention, that illustrates a packet- 
based central conference bridge that could be coupled 
to a packet-based network for enabling voice conferenc- 
es es between numerous sources of media signals, as will 
be described below as packet-based terminals. This 
conference bridge preferably replaces within FIGURE 
2, the conference bridge depicted within FIGURE 3A. 
There are a number of differences between the confer- 
ee ence bridge depicted in FIGURE 4 and that of FIGURE 
3A as will be described herein below. These differences, 
in some circumstances, decrease the transcoding and 
latency inherently within the traditional packet-based 
central conference bridge and reduce the required sig- 
25 nal processing power. 

[0026] As depicted in FIGURE 4, the conference 
bridge of the first and second preferred embodiments 
comprises a packet receipt block 50, an energy detec- 
tion and talker selection block 60, and an output gener- 
ic ator 70. Although the blocks within FIGURE 4 are de- 
picted as separate components, these blocks are meant 
to be logical representations of algorithms which are 
hereinafter referred to collectively as conferencing con- 
trot logic. Preferably, some or all of the conferencing 
35 control logic is essentially software algorithms operating 
within a single control component such as a DSP. In al- 
ternative embodiments, some or all of the conferencing 
control logic is comprised of hard logic and/or discrete 
components. 

40 [0027] The operations of the packet receipt block 50 
and the energy detection and talker selection block 60, 
according to both the first and second preferred embod- 
iments, will be described with reference to FIGURE 5. 
The key difference between the first and second embod- 

45 iments of the present invention, as will be described 
herein below, is the operations performed within the out- 
put generator 70. The operation of the output generator 
70, according to the first preferred embodiment, will be 
described with reference to FIGURE 6 while the opera- 

so tion of the output generator 70, according to the second 
preferred embodiment, will be described with reference 
to FIGURE 8. It is noted that when using the packet- 
based central conference bridge of the first preferred 
embodiment, the participants within a voice conference 

55 preferably can utilize well-known packet-based termi- 
nals such as the handset depicted in FIGURE 3B. On 
the other hand, when using the conference bridge ac- 
cording to the second preferred embodiment, the pack- 
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et-based terminals utilized by the participants of a voice 
conference preferably must be modified, as will be de- 
scribed herein below with reference to FIGU RES 1 0 and 
1 1 , compared to well-known packet-based handsets. In 
the case that the user is using a non-packet-based ter- 5 
minal via a packet-based network interface, it is noted 
that a similar situation arises. In the first preferred em- 
bodiment, a well-known packet-based network interface 
can be utilized which is similar to that depicted in FIG- 
URE 3B but with the decompressed signals being sent 
on the non-packet-based telephone network (such as a 
PCM telephone network) to the appropriate non-packet- 
based terminals (such as PCM terminals) rather than to 
a speaker. In the case of the second preferred embod- 
iment, the packet-based network interface used will 
have to be modified as will be described below with ref- 
erence to FIGURES 10 and 11. 

[0028] FIGURE 5 is a flow chart that depicts the steps 
performed by the packet receipt block 50 and the energy 
detection and talker selection block 60 according to both 
the first and second preferred embodiments of the 
present invention. This flow chart depicts the processing 
that occurs for a single voice data packet received by 
the packet-based central conference bridge. It should 
be understood that multiple packets could proceed 
through this procedure at any one time which could pos- 
sibly result in more than one packet being processed at 
the same step at the same time. Since these steps are 
preferably software operations, the situation in which a 
multiple number of packets operate at a common step 
within the procedure simply indicates that the software 
is being used by different packets in parallel. 
[0029] The first step 80, as depicted in FIGURE 5, has 
the packet receipt block 50 receive a voice data packet 
from the packet-based network coupled to the confer- 
ence bridge. This packet may be an IP packet or a pack- 
et of another format that can be transported on the pack- 
et-based network. The packet is sent from a packet- 
based terminal being used within a voice conference 
(more generally referred to as a source for media sig- 
nals) and contains a compressed voice signal that cor- 
responds to a participant that is speaking at the partic- 
ular terminal. 

[0030] Next, as seen at step 81, the packet receipt 
block 50 removes the packet overhead from the re- 
ceived voice data packet. This overhead may include 
the actual packet header and footer utilized, as well as 
any other transport protocol wrapper. The removal of the 
packet overhead results in only the compressed voice 
signal within the received packet being forwarded on for 
further processing. It is noted though that information 
contained within the packet overhead, such as the 
source address, is still preferably used by the control 
plane to identify the source terminal and the voice con- 
ference that this particular voice signal corresponds. 
Further, it is noted that a time stamp within an RTP head- 
er of the packet header is preferably extracted and used 
in later processing within the media plane as described 



below. 

[0031] The compressed voice signal is subsequently 
processed by the energy detection and talker selection 
block 60 as depicted at steps 82 through 90. Firstly with- 
in this processing, the block 60 determines if the com- 
pressed voice signal contains speech at step 82 by per- 
forming an energy detection operation. A compressed 
voice signal containing speech indicates that the source 
of the corresponding voice data packet has a speaking 
participant local. 

[0032] This energy detection operation can be per- 
formed in a number of different manners. In one pre- 
ferred embodiment, a Voice Activity Detection (VAD) op- 
eration is enabled at the packet-based terminal that sent 
the voice data packet; the VAD operation alternatively 
being enabled at the packet-based network interface if 
the source of media signals is a non-packet-based tel- 
ephone terminal. In this preferred embodiment, packets 
(and therefore compressed voice signals) that can con- 
tain speech can be distinguished from packets that do 
not by the number of bytes contained within the packet. 
In other words, the size of the compressed voice signal 
can determine whether it contains speech. For example, 
in the case that the G. 723.1 VoIP standard is utilized, 
voice data packets containing voice would contain a 
compressed voice signal of 24 bytes while voice data 
packets containing essentially silence would contain a 
compressed voice signal of 4 bytes. 
[0033] In another preferred embodiment, in which a 
VAD operation is not enabled at the packet-based ter- 
minal (or packet-based network interface) sending the 
voice data packet, the block 60 determines if there is 
speech within the compressed voice signal by monitor- 
ing a pitch-related sector within the corresponding voice 
data packet. For example, within the G. 723.1 VoIP 
standard, the pitch sector is an 1 8-bit field that contains 
pitch lag information for all subframes. In this particular 
embodiment, the block 60 uses the pitch sector to gen- 
erate a pitch value for each subframe. If the pitch value 
is within a particular predetermined range, the corre- 
sponding compressed voice signal is said to contain 
speech. If not, the compressed voice signal is said to 
not contain speech. This predetermined range can be 
determined by experimentation or alternatively calculat- 
ed mathematically. It is noted that many current VoIP 
standard codecs include pitch information as part of the 
transmitted packet and a similar comparison of pitch val- 
ues with a predetermined range can be used with these 
standards. It is further noted that the energy determina- 
tion operations which determine whether a particular 
compressed voice signal contains speech should not be 
limited to the above described embodiments. 
[0034] If the compressed voice signal at step 82 is 
deemed to not contain speech, the particular signal is 
discarded at step 83. The frequency in which signals are 
discarded from a signal source based upon there lack 
of speech affects the de-selection of talkers for the voice 
conference as will be described herein below. If thecom- 



15 



20 



25 



30 



35 



40 



45 



50 



6 



BNSDOCIQ <EP. 



.111 3657 A2_l_> 



11 



EP1 113 657 A2 



12 



pressed voice signal at step 82 does contain speech, 
the energy detection and talker selection block 60 pro- 
ceeds to determine at step 84 whether the compressed 
voice signal is from a packet-based terminal (more gen- 
erally a source of media data packets) selected to be a 
talker; voice signals from talkers being the only voice 
signals heard by the participants within the voice con- 
ference. 

[0035] The selection and de-selection of terminals as 
talkers is performed by a talker selection algorithm with- 
in the block 60. Although it is the terminal that is refer- 
enced as the source for the voice data packets contain- 
ing speech, for simplicity herein below, the description 
will refer to the talker selection algorithm determining 
which participants are speaking rather than referring to 
which terminals have participants that are speaking. It 
should be recognized that a reference to a participant 
speaking indicates that the voice data packet received 
from the terminal corresponding to the particular partic- 
ipant has been deemed to contain speech. 
[0036] There are three main situations, according to 
preferred embodiments, which would result in different 
operations for the talker selection algorithm, these situ- 
ations being no participants speaking, only one partici- 
pant speaking, and two or more participants speaking 
at once. For the first case in which there is no partici- 
pants speaking, the talker selection algorithm preferably 
has no terminals selected as talkers, thus preventing the 
sending of any voice data packets from the packet- 
based central conference bridge and further removing 
the need for any further processing to take place. Alter- 
natively, the talker selection algorithm could transmit 
empty voice data packets to the terminals within the 
voice conference when there are no talkers selected in 
order to maintain continuous packet transmission. 
[0037] When considering the second case in which 
only one participant is speaking, the talker selection al- 
gorithm preferably has only one terminal selected as a 
talker, that terminal being the one corresponding to the 
speaking participant. In this situation, the single talker 
is hereinafter referred to as a "lone talker". 
[0038] In the third case in which two or more partici- 
pants at different terminals are speaking at the same 
time, the talker selection algorithm preferably has one 
terminal selected as a "primary talker" and a second ter- 
minal selected as a "secondary talker" for the voice con- 
ference. When considering this situation, the talker se- 
lection algorithm, according to preferred embodiments, 
selects the primary and secondary talkers using a pre- 
determined selection parameter. In one preferred em- 
bodiment, this selection parameter is.the order in which 
the participants began to speak. In another embodi- 
ment, the selection parameter takes into consideration 
the volume level of the participants (i.e. comparing the 
energy levels of the talkers). In yet another embodiment, 
a control mechanism is in place that automatically se- 
lects a participant to be the primary or secondary talker. 
This control mechanism could be utilized in cases that 



there is a moderator and/or a scheduled speaker for the 
voice conference. 

[0039] The above described selection parameters are 
not meant to limit the scope of the present invention. In 
5 fact, the key to this portion of the preferable packet- 
based central conference bridge is the selection of talk- 
ers while the parameter used for this selection and the 
number of talkers selected is not directly relevant to the 
present invention. 
10 [0040] Preferably the talker selection algorithm com- 
prises a software algorithm that is continuously operat- 
ing during a voice conference with the determination of 
those speaking and the selection of no talkers, a lone 
talker, or primary and secondary talkers being dynamic 
15 during the receiving of voice data packets as will be de- 
scribed with reference to steps 84 through 90. As well, 
the talker selection algorithm preferably performs oper- 
ations to de-select talkers continuously during the voice 
conference. These de-selection operations preferably 
including the steps of determining the length of time be- 
tween voice data packets containing speech coming 
from the talker(s) and de-selecting any talker if the 
length of time between voice data packets containing 
speech exceeds a threshold level. Of course, other de- 
selection techniques could be utilized as the actual de- 
selection operation being used is not critical to the 
present invention. 

[0041] Referring back to FIGURE 5, the above de- 
scribed talker selection algorithm, for the case that the 
talker selection parameter is the order in which the par- 
ticipants begin to speak and a maximum of two talkers 
are selected at once, is implemented in steps 84 through 
90. As mentioned previously at step 84, the energy de- 
tection and talker selection block 60 determines if the 
compressed voice signal is from a participant selected 
as a talker If the compressed signal is from a talker, the 
talker selection algorithm determines, as depicted at 
step 85, if the talker is a lone talker, a primary talker, or 
a secondary talker. As will be described herein below, 
the output generator 70 processes the compressed 
voice signal differently depending on the "type" of talker 
it corresponds to. 

[0042] If, at step 84, the compressed voice signal 
does not correspond to a talker selected by the talker 
selection algorithm, the talker selection algorithm pro- 
ceeds to determine if there are currently two talkers se- 
lected at step 86. If there are two talkers already select- 
ed, the compressed voice signal is discarded at step 83 . 
If there are not two talkers already selected at step 86, 
the talker selection algorithm determines if there is cur- 
rently a lone talker selected at step 87. If there is not a 
lone talker already selected at step 87, the talker selec- 
tion algorithm selects the participant corresponding to 
the particular compressed voice signal as the lone talker 
at step 88. If there is a lone talker currently selected at 
step 87, the talker selection algorithm proceeds to set 
the participant corresponding to the particular com- 
pressed voice signal as the secondary talker at step 89 
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and to set the lone talker as the primary talker at step 
90. The output generator 70, as described below, then 
processes the compressed voice signal as if it was re- 
ceived from the particular talker it's corresponding par- 
ticipant is now set as. 

[0043] The procedure that occurs within the output 
generator 70, according to the first preferred embodi- 
ment, if the compressed voice signal corresponds to one 
of a lone talker, a primary talker, and a secondary talker 
will now be described with reference to FIGURE 6. First- 
ly, at step 94. if the compressed voice signal corre- 
sponds to the secondary talker, the compressed voice 
signal, hereinafter referred to as a secondary voice sig- 
nal, is initially encapsulated into a packet format suitable 
for transmission on a packet-based network and further 
transmitted to the primary talker via the packet-based 
network. Next, the output generator determines whether 
the secondary voice signal has previously been regen- 
erated for at step 96 by monitoring the time stamp as- 
sociated with the secondary voice signal and comparing 
it to the time stamps associated with previously received 
secondary voice signals. If it is found that the voice sig- 
nal was previously regenerated for, the secondary voice 
signal is discarded at step 98 and the conferencing con- 
trol logic returns to step 80. If it is found that the voice 
signal has not previously been regenerated for, the sec- 
ondary voice signal, as depicted at step 100, is decom- 
pressed (converting it into a decompressed voice signal 
that is preferably a PCM signal) and preferably tempo- 
rarily saved within the output generator 70 in both com- 
pressed and decompressed formats. Alternatively, the 
secondary voice signal is saved within only one of the 
compressed and decompressed formats. Saving in only 
the decompressed format would result in the need for a 
decompression operation at a subsequent step. 
[0044] If it is determined that the compressed voice 
signal corresponds to the primary talker, the output gen- 
erator 70, as shown at step 1 02, encapsulates the voice 
signal, hereinafter referred to as the primary voice sig- 
nal, within a packet format satisfactory for transmission 
on a packet-based network and further transmits the re- 
sulting voice data packet to the secondary talker via the 
packet-based network. Subsequently, at step 104, it is 
determined whether there is a secondary voice signal 
currently saved within the output generator 70 with a 
corresponding time stamp. 

[0045] If there is no corresponding secondary voice 
signal currently saved, it is determined at step 106 
whether a predetermined time T has expired at step 1 06. 
This predetermined time T is a waiting period in which 
the output generator 70 will not transmit the primary 
voice signal as the procedure returns to step 104. This 
compensates for minor delays caused in the network by 
providing the voice data packets arriving from the sec- 
ondary talker a limited amount of leeway after the arrival 
of a voice data packet corresponding to the primary talk- 
er. Preferably, if ho voice data packets arrive from the 
secondary talker after the time T expires, the voice data 



packets corresponding to the primary talker are not sub- 
sequently delayed by this delay mechanism. If the pre- 
determined iime T has expired at step 1 06, a voice sig- 
nal is generaied for the secondary talker at step 1 08 with 

5 the use of a well-known packet loss concealment algo- 
rithm. This generated voice signal is an approximation 
of what the secondary talker is saying based upon pre- 
vious secondary voice data packets that were received. 
[0046] After the generation of a secondary voice sig- 

10 nal at step 1 08 or if there was a corresponding second- 
ary voice signal currently saved at step 104, a number 
of operations, as depicted at step 110, are preferably 
performed by the output generator 70 according to the 
first preferred embodiment. These operations include 

15 decompressing the compressed primary voice signal 
(and secondary voice signal if previously not done), 
hence converting it into an uncompressed voice signal 
that is preferably a PCM signal; mixing the primary voice 
signal with the secondary voice signal using a well- 

20 known mixing algorithm as is currently used for combin- 
ing two uncompressed voice signals such as PCM sig- 
nals, the primary and secondary voice signals being 
combined into a single uncompressed voice signal 
(preferably a PCM signal); compressing the resulting 

25 mixed voice signal; encapsulating the compressed 
mixed voice signal within a packet format capable of 
transmission on a packet-based network, this packet 
format preferably including a new Real Time Protocol 
(RTP) header with a time stamp; and transmitting the 

30 resulting voice data packet containing the compressed 
mixed voice signal to all the participants within the voice 
conference with the exception of the primary and sec- 
ondary talkers. The transmitting of the resulting voice 
data packet preferably includes a unicast transmission 

35 to each participant that is to receive the particular voice 
data packet, a unicast transmission being a single trans- 
mission that travels from point A to point B. In an alter- 
native embodiment, a single multicast transmission is 
sent in place of the plurality of unicast transmissions, 

40 the multicast transmission including the mixed voice sig- 
nal, the unmixed primary and secondary voice signals, 
and an indication of which terminals should broadcast 
which voice signals. In this alternative, steps 94 and 1 02 
would be removed. 

45 [0047] If the compressed voice signal was determined 
to correspond to a lone talker, the output generator 70 
preferably, as depicted at step 112, encapsulates the 
compressed voice signal in a packet format suitable for 
transmission on a packet-based network and subse- 

50 quently transmits the voice data packet to all the partic- 
ipants within the voice conference with the exception of 
the lone talker. Similar to the description above, this 
voice data packet would preferably be transmitted using 
one or more unicast transmissions. 

55 [0048] One of the keys to the packet-based central 
conference bridge according to the first preferred em- 
bodiment as described herein above is that the voice 
data packets received from the primary talker drive the 
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transmission of the voice data packets mixed with the 
primary and secondary voice signals. This, along with 
the operation of the jitter buffers within the packet-based 
terminals as seen in FIGURE 3B (or alternatively within 
the packet-based network interface), allow for the jitter 
buffers shown within FIGURE 3A to be removed. The 
functionality of these jitter buffers 38 within FIGURE 3A f 
such as buffering to ensure smooth playbacks, is per- 
formed with the jitter buffers within the packet-based ter- 
minals. 

[0049] The problem with out-of-order voice data pack- 
ets from the lone or primary talkers being received at 
the conference bridge can be dealt with in a number of 
ways without the use of a jitter buffer. It is noted that out- 
of-order voice data packets from the secondary talker 
are already compensated for within the procedure of 
FIGURES. Firstly, in cases that out-of-order packets are 
not a significant problem, the conference bridge can dis- 
card any received voice data packets from the primary 
or lone talkers if they arrived later than a voice data 
packet from the same talker with an earlier time stamp. 
In an alternative embodiment to avoid out-of-order prob- 
lems within packets received from primary or lone talk- 
ers, the time stamp from the original primary or lone 
voice signal is included as the time stamp for the voice 
data packet containing the mixed voice signal-, these 
time stamps causing the jitter buffers within the termi- 
nals to compensate for the out-of-order packets of con- 
cern. Within further alternatives, a shallow jitter buffer 
could be implemented within the conference bridge to 
ensure the primary or lone voice signals are within the 
proper sequence. 

[0050] FIGURE 7 is a logical block diagram illustrating 
the functionality of the packet-based central conference 
bridge according to the first preferred embodiment in the 
case that two or more participants are currently talking. 
As depicted in FIGURE 7, the conference bridge in this 
situation logically comprises a plurality of protocol 
stacks 52, a plurality of energy detection blocks 62 each 
coupled in series with a respective one of the protocol 
stacks 52, a talker selection block 64 coupled independ- 
ently to each of the energy detection blocks 62, partici- 
pant A and B transmitters 71 ,72 independently coupled 
to the talker selection block 64, two decompression 
blocks 74 independently coupled to the talker selection 
block 64, a mixer 76 coupled to each of the decompres- 
sion blocks 74, a compression block 78 coupled to the 
mixer 76, and a plurality of transmitters 79 coupled to 
the compression block 78. 

[0051] As can be seen in FIGURE 7, voice data pack- 
ets from each of the participants, participants A through 
Z in this case, are input to a respective protocol stack 
52. In this embodiment, these protocol stacks 52 are the 
only logical component within the packet receipt block 
50, as no jitter buffers similar to those within the well- 
known conference bridge depicted in FIGURE 3A are 
implemented. The protocol stacks 52 remove the packet 
overhead from the received voice data packets and out- 



put voice signals in compressed format. In preferred em- 
bodiments, the protocol stacks 52 together comprise a 
single software algorithm that is run for each received 
packet. In these preferred embodiments, the software 
5 algorithm is possibly run multiple times in parallel as nu- 
merous packets from different participants can be re- 
ceived at one time. 

[0052] In the logical block diagram of FIGURE 7 it can 
be seen that the compressed voice signal output from 
10 each of the protocol stacks 52 is subsequently received 
by a corresponding energy detection block 62. These 
energy detection blocks 62 are preferably one of the log- 
ical components within the energy detection and talker 
selection block 60, with the energy detection blocks 62 
15 together comprising a single software algorithm that is 
run for each compressed voice signal. It is determined 
for each of the voice signals within the received voice 
data packets whether the voice signal contains speech 
with use of the energy detection blocks 62, these deter- 
minations being forwarded to the talker selection block 
64. 

[0053] The talker selection block 64 preferably re- 
ceives the determinations of which of the received voice 
signals contain speech and, in the case of two or more 
speakers, determine who is the primary and secondary 
talkers. FIGURE 7 depicts the case that there are at 
least two current talkers in the voice conference and the 
talker selection block 64 has selected participant A as 
the primary talker and participant B as the secondary 
talker. 

[0054] This results, within the output generator 70, in 
compressed voice signals from participant A being sent 
to the participant B transmitter 72 and one of the decom- 
pression blocks 74 while the compressed voice signals 
from participant B are sent to the participant A transmit- 
ter 71 and the other decompression block 74. The trans- 
mitters 71 ,72 subsequently encapsulate the received 
compressed voice signals into voice data packets, pref- 
erably including adding an RTP header with a times- 
tamp, and transmit the packets to the appropriate par- 
ticipants. Assuming that the compressed voice signal 
corresponding to participant B arrives within the prede- 
termined time T of the voice signal corresponding to par- 
ticipant A, the compressed voice signal of participants 
A and B are decompressed such that they are preferably 
in PCM format, mixed together, compressed, and sub- 
sequently encapsulated and transmitted to the other 
participants within the voice conference (those being 
participants C through Z), the encapsulation similarly in- 
cluding an RTP header with a timestamp in preferred 
embodiments. It is noted that the transmitters 71 ,72,79 
together preferably comprise a single transmitting algo- 
rithm that is run for each of the participants in the voice 
conference. 

[0055] Although the first preferred embodiment of the 
present invention is as described above with reference 
to FIGURES 4 through 7, this description is not meant 
to limit the scope of the present invention. Numerous 



25 



30 



35 



40 



45 



50 



9 



BNSDOCID: <EP. 



1 1 13657 A2_l_> 



17 



EP 1 113 657 A2 



18 



alternatives are possible such as the removal of the pre- 
determined time T step 1 06. This would result in the im- 
mediate generation of a secondary voice signal in the 
case that no such signal was previously saved. Further, 
although the first preferred embodiment describes the 
mixing of only the primary and secondary talkers, other 
embodiments could have the selection of more than two 
talkers and the subsequent mixing of all the selected 
voice signals. For such an alternative, a third talker 
could be selected which has its corresponding voice sig- 
nals mixed with the primary voice signals, the result be- 
ing sent to the secondary talker only, and mixed with the 
secondary voice signals, the result being sent to the pri- 
mary talker only. This alternative could allow a third talk- 
er to notify the primary and/or secondary talker that he/ 
she would like to speak. In this case, the other partici- 
pants in the conference call would not hear the third talk- 
er until one of the primary and secondary talkers ceased 
speaking so that they would be deselected as a talker 
[0056] There are numerous advantages to the pack- 
et-based central conference bridge according to the first 
preferred embodiment over the well-known conference 
bridge depicted in FIGURE 3A. The selection of talkers 
(no talkers, a lone talker, or primary and secondary talk- 
ers) prior to the decompression of the voice signals re- 
duces the required signal processing power and possi- 
bly the latency and transcoding for the overall confer- 
ence bridge. In the case that there are no talkers or only 
a single tone talker, no decompression, mixing, and re- 
compression is required within the design according to 
the first preferred embodiment. If there are no talkers, 
no further processing after the talker selection algorithm 
is preferably performed. If there is only a lone talker, the 
compressed voice signal corresponding to the lone talk- 
er is simply encapsulated and transmitted to all the other 
participants within the voice conference with no trans- 
coding and hence better signal quality. In both of these 
cases, the required signal processing power is signifi- 
cantly reduced due to lack of decompression and re- 
compression and, for the case of the lone talker, the la- 
tency of the conference bridge is further reduced and 
the signal quality is improved. If there are two or more 
speakers, and hence primary and secondary talkers se- 
lected by the talker selection algorithm, the required sig- 
nal processing power can be reduced using the confer- 
ence bridge according to the first preferred embodiment. 
This reduction in required DSP power results from not 
being required to decompress all incoming voice sig- 
nals. In the conference bridge according to the first pre- 
ferred embodiment, only voice signals corresponding to 
the primary and secondary talkers are decompressed. 
Further, the primary and secondary voice signals which 
are directly sent to the secondary and primary talkers 
respectively have similar advantages to the lone talker 
described above. 

[0057] A further advantage of the first preferred em- 
bodiment results since the design depicted in FIGURES 
4 through 7 requires no jitter buffers. The jitter buffers 



38 within the well-known conference bridge design of 
FIGURE 3A increases the latency of the conference 
bridge as well as increasing the required signal process- 
ing power for the overall conference bridge. With the de- 

5 sign according to the first preferred embodiment no jitter 
buffers are necessary, hence reducing the latency and 
required signal processing power of the conference 
bridge by that caused by the jitter buffers. 
[0058] The packet-based central conference bridge 

10 according to the second preferred embodiment of the 
present invention will now be described with reference 
to FIGURES 8 and 9. As stated previously, the use of 
the conference bridge of the second preferred embodi- 
ment requires modified packet-based terminals and/or 

15 modified packet-based network interfaces to be used by 
the participants. As such, a description of a packet- 
based terminal and packet-based network interface ac- 
cording to the second preferred embodiment with refer- 
ence to FIGURES 10 and 11 will follow the description 

20 of FIGURES 8 and 9. 

[0059] The packet-based central conference bridge 
according to the second preferred embodiment, as pre- 
viously described, is consistent with the simplified block 
diagram of FIGURE 4. Further, the operation of the 

25 packet receipt block 50 and the energy detection and 
talker selection block 60 as depicted in the flow chart of 
FIGURE 5 is consistent with the operation of these 
blocks within the conference bridge of the second pre- 
ferred embodiment. The key difference between the 

30 conferencing control logic for the first and second pre- 
ferred embodiments relates to the operation of the out- 
put generator 70, this difference being described herein 
below. 

[0060] The procedure that occurs within the output 
35 generator 70, according to the second preferred embod- 
iment, if the compressed voice signal corresponds to 
one of a lone talker, a primary talker, and a secondary 
talker will now be described with reference to FIGURE 
8. The flow chart of FIGURE 8 is identical to the flow 
40 chart of FIGURE 6 described herein above in detail with 
the exception of steps 100 and 110. In other words, 
steps 94 through 98, 1 02 through 1 08, and step 1 1 2 are 
identical for both the first and second preferred embod- 
iments. 

45 [0061 ] I n the case that a compressed secondary voice 
signal is received at the output generator 70, the gener- 
ator 70 proceeds through steps 94 and 96 as previously 
described. If the secondary voice signal had not previ- 
ously been regenerated for at step 96, the voice signal 

so j s temporarily saved within the output generator 70 at 
step 114. The difference between step 100 (first pre- 
ferred embodiment) and step 1 1 4 (second p referred em- 
bodiment) is the lack of a decompression operation with- 
in step 114. Once saved, the conferencing control logic 

55 returns to step 80 of FIGURE 5. 

[0062] In the case that a compressed primary voice 
signal is received at the output generator 70, the gener- 
ator proceeds through steps 102 through 108 as previ- 
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ously described. If there was a secondary voice signal 
saved at step 1 04 or if a secondary voice signal was 
generated at step 108, the output generator proceeds 
through a number of operations as depicted at step 1 1 6. 
These operations include both the compressed primary 
and secondary voice signals being encapsulated within 
a packet format suitable for transmission on a packet- 
based network, this packet format preferably including 
an RTP header with a time stamp, and the resulting 
voice data packet(s) being transmitted to all the partic- 
ipants within the voice conference with the exception of 
the primary and secondary talkers. The encapsulation 
of the primary and secondary voice signals preferably 
entails placing the two signals within the same data sec- 
tion of a single packet with no mixing. The bandwidth 
efficiency of the voice communication system is in- 
creased using this technique when compared to an al- 
ternative in which the primary and secondary voice sig- 
nals are transmitted in separate packet overheads. This 
increase in bandwidth efficiency is due to the large pro- 
portion of packet overhead bytes that are required within 
a typical packet format. Hence, only requiring a single 
packet overhead rather than two can significantly in- 
crease the bandwidth efficiency. Similar to the transmis- 
sion in the first preferred embodiment, the transmission 
of these voice data packets is preferably a unicast trans- 
mission corresponding to each participant that is to re- 
ceive the voice data packet or alternatively could be a 
single multicast transmission if the individual terminals 
can determine whether it should broadcast only one of 
the compressed voice signals (if the terminal is the pri- 
mary or secondary talker) or both (if it is not the primary 
or secondary talker). 

[0063] In the case that a compressed voice signal 
from a lone talker is received at the output generator 70 
of the second preferred embodiment, the operation at 
step 1 1 2 is the same as previously described for the first 
preferred embodiment. In this case, the voice signal is 
encapsulated and transmitted to all the participants in 
the voice conference with the exception of the lone talk- 
er, this transmission being either one or more unicast 
transmissions or alternatively a single multicast trans- 
mission. 

[0064] FIGURE 9 is a logical block diagram that illus- 
trates the functionality of the packet-based central con- 
ference bridge according to the second preferred em- 
bodiment in the case that the talker selection algorithm 
determines that there are two or more speakers and fur- 
ther selects primary and secondary talkers. In FIGURE 
9, the protocol stacks 52, energy detection blocks 62, 
and talker selection block 64 are identical to that de- 
scribed herein above for FIGURE 7. The difference be- 
tween FIGURES 7 and 9 resides within the output gen- 
erator 70. Within FIGURE 9, the output generator 70 re- 
ceives voice signals from a primary talker and a second- 
ary talker, in this case participants A and B respectively. 
As depicted in FIGURE 9, the output generator 70 sub- 
sequently forwards the secondary voice signal to partic- 



ipant A, the primary voice signal to participant B, and 
both the primary and secondary voice signal to partici- 
pants C through Z. Although not shown in FIGURE 9, 
these voice signals are forwarded to the appropriate 

5 participants by encapsulating the voice signals and 
transmitting the resulting voice data packets to the ap- 
propriate participant via a packet-based network. 
[0065] There are numerous alternatives to the packet- 
based central conference bridge according to the sec- 

10 ond preferred embodiment. For one, step 106 in which 
a primary voice signal is possibly delayed by a prede- 
termined time T is removed in some embodiments, thus 
resulting in the immediate generation of a secondary 
voice signal in the case that there is no saved secondary 

15 voice signal during, the arrival of a primary voice signal. 
Further, other alternative embodiments do not have the 
option of generating secondary voice signals or sending 
the primary and secondary signals within a single voice 
data packet. In these embodiments, upon the arrival of 

20 a primary voice signal, the output generator 70 simply 
encapsulates the signal and transmits the resulting 
voice data packet to all of the participants within the 
voice conference except the primary talker. The same 
operation is performed in the case that a secondary 

25 voice packet arrives at the output generator 70 except 
with the secondary talker being the only participant not 
to receive the corresponding voice data packet. 
[0066] Yet further alternative embodiments have 
more than two participants selected as talkers, resulting 

30 in voice signals corresponding to more than two talkers 
being forwarded to the other participants within the voice 
conference. In one such alternative, a third talker is se- 
lected similar to that described for an alternative to the 
first preferred embodiment. 

35 [0067] A packet-based terminal and a packet-based 
network interface that can operate with the packet- 
based central conference bridge of the second preferred 
embodiment are now described with reference to FIG- 
URES 10 and 11. FIGURE 10 is a simplified block dia- 

^0 gram of a packet-based apparatus that can represent 
either the packet-based terminal or packet-based net- 
work interlace according to the second preferred em- 
bodiment, this apparatus comprising a packet receipt 
block 120 and an output generator 130. FIGURE 11 is 

45 a logical block diagram illustrating the packet-based ap- 
paratus of FIGURE 1 0 in the case that a voice data pack- 
et containing both primary and secondary voice signals 
is received at the apparatus. In the case that a voice 
data packet containing a voice signal from a lone talker 

so is received at the apparatus, a logical depiction of the 
packet-based terminal and packet-based network inter- 
face would be consistent with that depicted in FIGURE 
3B for a well-known packet-based terminal and-packet- 
based network interface. 

55 [0068] The packet receipt block 120 preferably re- 
ceives a voice data packet containing one or two voice 
signals (one voice signal if from a lone talker or two voice 
signals if from primary and secondary talkers) from the 
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packet-based central conference bridge of the second 
preferred embodiment. The packet receipt block 120 
performs a number of logical operations to the received 
packets as can be seen in FIGURE 11 with respect to 
protocol stack 122 and jitter buffer 124. These blocks 
1 22,124 have similar functionality to that previously de- 
scribed for the protocol stack 47 and the jitter buffer 48 
respectively, both within FIGURE 3B, Hence, when re- 
ceiving voice data packets, the packet receipt block 1 20 
strips the packet overhead from the voice data packets, 
leaving only the compressed voice signals; ensures that 
the compressed voice signals of the primary and sec- 
ondary talkers are within the proper sequence (i.e. time 
ordering voice signals); buffers the compressed voice 
signals of the primary and secondary talkers to ensure 
smooth playback; and implements packet loss conceal- 
ment. The first operation is preferably performed by the 
protocol stack 122 while the last three operations are 
preferably performed by the jitter buffers 124. In FIG- 
URE 11, these jitter buffers 124 are logically depicted 
as two jitter buffers despite preferably consisting of a 
single algorithm being run for the compressed voice sig- 
nals of both the primary and secondary talkers. In fact, 
all of these operations are preferably algorithms running 
on one or more DSP devices, though alternatively they 
are performed by hard logic and/or discrete compo- 
nents. The end result of the operations within the packet 
receipt block 120 is the outputting of either one or two 
sets of compressed voice signals that are within the 
proper order. 

[0069] The output generator 130 preferably receives 
these set(s) of compressed voice signals and processes 
them so that an uncompressed set of voice signals are 
sent to a speaker (not shown) in the case of the packet- 
based apparatus being a packet-based terminal or to a 
non-packet-based telephone terminal (not shown) such 
as a PCM terminal, via a non-packet-based telephone 
network (not shown) such as a PCM telephone network, 
in the case of the packet-based apparatus being a pack- 
et-based network interface. As can be seen within FIG- 
URE 1 1 for the case that two series of voice signals (pri- 
mary and secondary) are received, the output generator 
130 logically comprises two decompression blocks 132 
and a mixer 134. In this case, the output generator 130 
operates to decompress the compressed primary and 
secondary voice signals with decompression blocks 
132, resulting in two streams of uncompressed voice 
signals (preferably PCM voice signals). Subsequently, 
these two streams of uncompressed voice signals are 
mixed to generate a set of voice signals that are output. 
Blocks 132,134 are preferably algorithms being run on 
one or more DSP devices, though alternatively they are 
operations performed by hard logic and/or discrete com- 
ponents. In the case that a single set of voice signals 
corresponding to a lone talker are received at the output 
generator, the voice signals are decompressed and for- 
warded. 

[0070] There are alternative embodiments to the 



packet-based terminal and packet-based network inter- 
face of FIGURES 10 and 11 , most of which are based 
off of alternative embodiments to the packet-based cen- 
tral conference bridge of the second preferred embodi- 

5 ment. In one alternative embodiment, all of the voice da- 
ta packets being received by the packet receipt block 
120 contain only one voice signal that corresponds to 
one of a primary talker, a secondary talker, or a lone 
talker. In this embodiment, an indication to the type of 

10 talker the voice signal corresponds is preferably includ- 
ed within the signals' packet overhead. Along with this 
indication, a time stamp preferably is also included in 
order to determine which primary and secondary voice 
signals correspond and hence should be mixed. Atter- 

15 natively, another identification item could be used rather 
than time stamps to determine which primary and sec- 
ondary voice signals should be mixed. Exemplary em- 
bodiments of this alternative allow for primary or sec- 
ondary voice signals to be generated for in cases that 

20 they are not received at the packet-based terminal with- 
in a predetermined time interval of their respective sec- 
ondary or primary voice signals. 

[0071 ] In another alternative embodiment, the packet- 
based apparatus of FIGURES 10 and 11 could be ex- 

25 panded to receive and process more than just the voice 
signals from two talkers. This is preferable in the case 
that the packet-based apparatus is a packet-based net- 
work interface. Therefore, the packet-based network in- 
terface can operate as an interface between a packet- 

30 based network and a non-packet-based telephone net- 
work such as a PCM telephone network for a plurality 
of non-packet-based telephone terminals such as PCM 
telephone terminals. 

[0072] There are numerous advantages of using the 
35 packet-based central conference bridge and packet- 
based apparatus according to the second preferred em- 
bodiment when within a voice conference. For one, sim- 
ilar advantages are found as stated above for the reduc- 
tion in latency and required signal processing power 
40 with the removal of the jitter buffers within the confer- 
ence bridge. As well, some of the other advantages of 
the first preferred embodiment also apply equally to the 
second preferred embodiment including the possible re- 
duction in latency, transcoding and required signal 
is processing power when selecting the talkers prior to de- 
compressing the voice signals. 

[0073] The second preferred embodiment is essen- 
tially the same as the first preferred embodiment except 
with the mixing of the primary and secondary voice sig- 

50 nals being performed at the packet-based terminals 
and/or packet-based network interfaces rather than at 
the conference bridge. This change results in advantag- 
es and disadvantages for the voice communication sys- 
tem of the second preferred embodiment when com- 

55 pared to the system of the first preferred embodiment. 
One disadvantage with the moving of the mixing algo- 
rithm is that a plurality of packet-based terminals and 
packet-based network interfaces must perform the mix- 
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ing rather than one central DSP within the conference 
bridge. Essentially, this will require an increase in the 
required signal processing power within all of the appli- 
cable packet-based terminals and packet-based net- 
work interfaces. 

[0074] One advantage of the voice communication 
system of the second preferred embodiment over the 
voice communication system of the first preferred em- 
bodiment is the removal of any need to decompress and 
then subsequently compress again, that being trans- 
coding as described previously. Decompression of the 
voice signals, as depicted in FIGURE 7, is required prior 
to the mixing of the voice signals and compression is 
required prior to the transmission. In the conference 
bridge and terminal/network interface according to the 
second preferred embodiment, there is only one re- 
quired decompression operation (that being at the ter- 
minals/network interfaces) and zero required compres- 
sion operations. On the other hand, within the similar 
apparatuses of the first preferred embodiment, two de- 
compression stages and one compression stage is nec- 
essary. This reduction in transcoding can directly lead 
to an increase in signal quality and a decrease in laten- 
cy. 

[0075] The overall effect of the above described lack 
of decompression and compression operations and the 
removal of the mixing operation, results in the central 
conference bridge according to the second preferred 
embodiment requiring less computational resources 
and therefore increased capacity in terms of ports. Sim- 
plicity of the conference bridge makes it more amenable 
to general purpose microprocessor devices, reducing 
the need for highly specialized DSPs that add significant 
costs. Therefore, the central conference bridge accord- 
ing to the second preferred embodiment does not have 
to be a specially designed apparatus but could be im- 
plemented within any device containing a microcontrol- 
ler capable of running software operations, such as a 
server, a call processor, a router, or an end user person- 
al computer. 

[0076] Some of the key advantages of the second pre- 
ferred embodiment relate to the possibility of making the 
packet-based central conference bridge relatively sim- 
ple by moving the mixing operation to the packet-based 
terminals and/or packet-based network interfaces. This 
reduction in complexity within the conference bridges 
can allow for increased flexibility and operations when 
it comes to the use of these apparatuses. 
[0077] One such additional operation concerns inter- 
locking a plurality of conference bridges as will now be 
described with reference to FIGURES 12A through 12C. 
In these figures, first, second, and third packet-based 
central conference bridges 140,142,144 according to 
the second preferred embodiment are illustrated, each 
of the conference bridges being coupled to both of the 
other conference bridges. As depicted in FIGURE 12A, 
each conference bridge 140,142,1 44 receives voice da- 
ta packets corresponding to a subset of all the partici- 



pants in a voice conference. In the case shown, the first, 
second, and third conference bridges 140,142,144 re- 
ceive voice data packets from participants A through H, 
I through Q, and R through Z respectively. Further, each 

5 conference bridge also receives voice data packets cor- 
responding to the primary and secondary talkers select- 
ed by the other conference bridges, these voice data 
packets containing the original source address of the 
participant. This setup potentially allows for a plurality 

10 of identical packets from a primary or secondary talker 
to arrive from different sources. In this case, the packets 
with the earliest arrival are preferably utilized, with the 
identical packets being discarded. It is preferably deter- 
mined whether two packets are identical with a combt- 

15 nation of the source address (which as stated above is 
maintained within packets being forwarded from one 
conference bridge to another) and by the packet se- 
quence number or a time stamp within the packet such 
as the RTP time stamp. 

20 [0078] As depicted in FIGURE 12A, the first confer- 
ence bridge 140 receives voice data packets corre- 
sponding to participants A through H and the voice data 
packets corresponding to the primary and secondary 
talkers selected by the second and third conference 

25 bridges 142,144. With all the received voice data pack- 
ets (including those from the other conference bridges), 
each conference bridge removes all late arriving dupli- 
cate packets, as described above, and does an energy 
detection and talker selection operation, as described 

30 previously with reference to block 60, on the remaining 
packets. A change in the selected primary and second- 
ary talkers at one conference bridge will change the 
voice data packets received at the other conference 
bridges, hence possibly changing the selection of talk- 

35 ers generated by the other conference bridges. In the 
case that all conference bridges have the same talker 
selection criteria, all of the conference bridges in equi- 
librium should have the same primary and secondary 
talkers selected. 

40 [0079] As depicted in FIGURE 12A, all three of the 
conference bridges 140,142,144 have selected partici- 
pant A and participant I as the primary and secondary 
talkers respectively. This results in the first conference 
bridge 140 receiving the voice data packets of partici- 

45 pant A from three different sources, those being direct 
from participant A and from both the second and third 
conference bridges 142,144, and receiving the voice da- 
ta packets of participant I from two different sources, 
those being both the second and third conference bridg- 

50 es 142 : 144. As described previously, the first confer- 
ence bridge 1 40 will utilize (for forwarding purposes) the 
packets being received from the best source, that being 
the source by which the packets arrive first. In the case 
shown in FIGURE 12A, the first conference bridge 140 

55 uses the packets of participant A directly received from 
participant A and the packets of participant I received 
from the second conference bridge 142. Alternatively, 
the packets of participant I received from the third con- 
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ference bridge 1 44 could arrive first due to a problem, 
such as congestion, delaying the packets sent directly 
from the second conference bridge 1 42. In this case, the 
packets of participant I being sent from the second con- 
ference bridge 1 42 via the third conference bridge 1 44 5 
would be utilized by the first conference bridge 140. It 
can be further seen in FIGURE 1 2A that the second and 
third conference bridges 142,144 similarly select be- 
tween identical packets (as determined by the source 
address and packet sequence numbers) from multiple 10 
sources when determining which packets to forward to 
the participants directly coupled to the particular confer- 
ence bridge and further to forward to the other confer- 
ence bridges. As with the first conference bridge 140, 
these conference bridges 142,144 select the packets 15 
with the earliest arrival. This ability to compensate for 
delays within the packet- based networks is one of the 
key advantages of this implementation. 
[0080] FIGURES 12B and 12C illustrate the network 
of interlocking conference bridges of FIGURE 1 2A, but 20 
while in a change of secondary talkers. In FIGURE 12B, 
the first conference bridge 1 40 is still receiving all of the 
signals described previously for FIGURE 12A, but the 
talker selection operation within the first conference 
bridge has changed its selection concerning the sec- 25 
ondary talker. Now, it has selected participant B as the 
secondary talker instead of participant I. The primary 
talker selection stays the same in this example. As de- 
picted in FIGURE 12B, the first conference bridge 140 
begins to transmit the voice data packets of participant 30 
B to the other conference bridges 1 42,144, but the other 
conference bridges at this point still have participant I 
selected as the secondary talker. If the other conference 
bridges 1 42,1 44 utilize the same selection criteria as the 
first conference bridge 1 40, the other conference bridg- 35 
es 142,144 will eventually select participant B as the 
secondary talker as is depicted in FIGURE 12C. 
This will return the system to equilibrium in which all of 
the participants in the voice conference can hear the 
same talkers. 40 
[0081] There are a number of advantages to the in- 
terlocked conference bridge configuration of FIGURES 
12A through 12C. One key advantage, as stated previ- 
ously, is the ability of this configuration to compensate 
for delays in the packet-based network being used. This 45 
ability is caused by the possibility of conference bridges 
receiving identical packets from a plurality of sources 
and being able to select between them preferably based 
upon the earliest arrival. 

[0082] Another key advantage that could occur with so 
the use of interlocked conference bridges is a reduction 
in bandwidth requirements within the packet-based net- 
work when establishing voice conferences between par- 
ticipants in dispersed locations. In traditional conference 
bridges such as the one depicted in FIGURE 3A, the 55 
voice packets corresponding to all of the participants 
must arrive at a single conference bridge. Using inter- 
locked conference bridges of the second preferred em- 



bodiment, the participants within a voice conference can 
be divided into a plurality of sets of participants, each 
set of participants being coupled to a different confer- 
ence bridge. The only communications between these 
interlocked conference bridges is with respect to pack- 
ets from selected primary and secondary talkers. The 
advantage can be understood best by example. In the 
case depicted in FIGURE 12A, if the participants A 
through H were based in Ottawa, Canada, participants 
I through Q were based in Santa Clara, California, and 
the participants R through Z were based in Richardson, 
Texas, the conference bridges 140,142,144 could be 
based in Ottawa, Santa Clara, and Richardson respec- 
tively. The only communications between these dis- 
perse cities would be with regard to the selected primary 
and secondary talkers. In previous implementations, the 
packets of all of these participants A through Z would 
have to be sent to a single conference bridge. 
[0083] It is noted that it would not be possible for pre- 
vious conference bridge designs, such as that depicted 
in FIGURE 3A, to be implemented in an interlocked con- 
figuration. 

Firstly, the conference bridges within an interlocked de- 
sign must not mix the voice signals corresponding to the 
primary and secondary talkers since this would not allow 
the other conference bridges to independently compare 
the primary and secondary talkers received form other 
conference bridges. Further, the latency associated with 
traversing one of these previously designed conference 
bridges typically results in unacceptable delays. In the 
interlocked design, voice data packets often traverse at 
least two conference bridges prior to being sent to a lis- 
tener within the voice conference, hence increasing 
even further the problem of latency within these previ- 
ous designs. The latency is not a critical problem using 
the conference bridges of the second preferred embod- 
iment within an interlocked configuration because of the 
relatively low latency associated with each of the con- 
ference bridges independently. 

[0084] Although the interlocked conference bridge 
configuration in FIGURES 12A through 12C depicts 
three conference bridges that have selected primary 
and secondary talkers, this is not meant to limit the 
scope of the present invention. For instance, more con- 
ference bridges or as few as two conference bridges 
could be interlocked. As well, in the case of a lone talker 
being selected within the voice conference, it should be 
understood that only a single packet would be sent to 
the other interlocked conference bridges. Further, it 
should be understood that the interlocked configuration 
could be used by conference bridges that select more 
than two talkers, with all of the packets associated with 
selected talkers being forwarded to the other confer- 
ence bridges of the voice conference. Yet further, the 
conference bridge does not necessarily have to receive 
voice data packets from individual packet-based termi- 
nals or packet-based network interfaces but could only 
receive voice data packets selected by other conference 
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bridges to be talkers. 

[0085] There are large numbers of yet further possible 
alternative embodiments to the interlocked configura- 
tion described herein above, many of which have yet 
further additional advantages. One such alternative has 
the conference bridges prevent the reforwarding of iden- 
tical packets back to the best (earliest arriving) source 
of the particular voice data packets. Hence, if a confer- 
ence bridge has voice data packets arriving from anoth- 
er interlocked conference bridge which are subsequent- 
ly selected as the earliest arriving packets correspond- 
ing to the primary or secondary talker, the particular 
packets are not forwarded back to the conference bridge 
source in this alternative. This alternative effectively re- 
duces the amount of voice data packets being ex- 
changed between the conference bridges, hence de- 
creasing the load on the packet-based network. 
[0086] Another alternative embodiment of the inter- 
locked configuration depicted in FIGURE 12A has at 
least one of the interlocked conference bridges sending 
the voice data packets corresponding to its selected pri- 
mary and secondary talkers to less than all of the other 
interlocked conference bridges. This setup can be used 
to reduce the number of packets traversing a bandwidth 
sensitive link such as links across the Atlantic. For in- 
stance, a voice conference of 100 participants in North 
America could be connected to a voice conference of 
100 participants in Europe with the only connection be- 
ing between two conference bridges that are further in- 
terlocked with a plurality of other conference bridges in 
their respective continents. This can result in the ex- 
changing of voice data packets corresponding to their 
respective selected primary and secondary talkers be- 
ing the only required transmissions over the Atlantic link. 
The trade-off to this configuration is possibly a slight in- 
crease in the latency due to some voice data packets 
possibly having to traverse more conference bridges to 
reach all of the other conference bridges in the voice 
conference. 

[0087] Yet further, other alternative embodiments to 
the interlocked conference bridge configuration of FIG- 
URES 12A through 12C, have the conference bridges 
interconnected within a large variety of configurations 
rather than a loop. In one case, the conference bridges 
are coupled in series with each conference bridge for- 
warding the voice data packets corresponding to its de- 
termined primary and secondary talkers to the confer- 
ence bridge on either side of it. In another alternative 
configuration, the conference bridges are coupled to a 
central conference bridge so that the conferences bridg- 
es essentially form a star. A large number of other con- 
figurations can be considered with the key consideration 
being the latency that would be required if the primary 
or secondary talkers were a large number of "hops" from 
other conference bridges within the interconnected net- 
work. It is noted that in preferred embodiments, the la- 
tency problem is not significant until a voice data packet 
must traverse a large number of conference bridges. 



[0088] Another additional operation that is possible 
with the use of conference bridges according to the sec- 
ond preferred embodiment is the defining of all packet- 
based voice communications as a conference session, 

5 whether there are two participants or hundreds. In this 
design, all voice data packets within a packet-based net- 
work traverse a conference bridge with each participant 
treated independently at the conference bridge. This al- 
lows each packet-based voice session, whether point- 
to to-point or a conference situation, to have a control 
mechanism operated with the use of conference bridg- 
es. This can allow for additional functionality within the 
control plane of a typical telephone session such as al- 
lowing participants to join the telephone session without 

*5 having to be initiated by a current participant, essentially 
giving the initiation control to a new participant. This is 
useful for people who wish to make a quick comment to 
one of the participants or for people who wish to join the 
conference session while it is in progress. For instance, 

20 one participant in a conference session could suggest 
to another person to join the conference session when 
he/she gets a chance, the person in this case is able to 
join at his/her will without disturbing the other partici- 
pants. Additionally, the flexibility of the second preferred 

25 embodiment allows for a voice conference to expand 
from a point-to-point voice communication to a larger 
conference session with ease, as every packet-based 
voice communication is easily scalable in this setup. 
[0089] Yet another additional operation that is possi- 

30 ble with the use of packet-based terminals or packet- 
based network interfaces of the second preferred em- 
bodiments is the ability to perform three way voice con- 
ferencing without the use of a central conference bridge. 
In the case of three participants within a voice confer- 

35 ence, the central conference bridge of the second pre- 
ferred embodiment can be seen to be performing an un- 
necessary function since the selection of talkers is not 
necessary in the case that the packet-based terminals 
and/or packet-based network interfaces can mix the 

40 voice signals from two sources, that being the maximum 
number of sources that the apparatus could possibly re- 
ceive voice data packets from at one time if only three 
participants are in the voice conference. 
[0090] Overall the present invention as described 

45 herein above has considerable advantages over the 
well-known voice conferencing techniques. These em- 
bodiments as described allow for the operations within 
the central conference bridge to have decreased laten- 
cy, decreased computational requirements, and an in- 

50 creased signal quality due to a reduction in transcoding. 
[0091] There are a. number of features that can be 
added to any one of the above embodiments of the 
present invention that have not previously been dis- 
cussed in detail. For one, a modified control plane is 

55 used such that a number of operations could be control- 
led with the transmission of control packets between 
participants and possibly a moderator. One such oper- 
ation could have a moderator established as a perma- 
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nent talker throughout the voice conference, possibly as 
a permanent secondary talker or possibly as a third se- 
lected talker. Another operation that could be controlled 
through use of a modified control plane is the manual 
selection of primary and/or secondary talkers. This may 
be useful in cases where a particular participant is 
scheduled to speak. Yet another possible operation that 
could be maintained with use of a modified control plane 
is a sidebar operation. In a sidebar operation, at least 
two of the participants within a voice conference can 
form a subset of participants smaller than the set that 
defines the entire voice conference. With this setup, one 
participant within the subset can choose to communi- 
cate with the entire voice conference or with only the 
members of the subset. 

[0092] Another feature that could be added to any one 
of the embodiments of the present invention described 
herein above is the sending of video streams via video 
data pgckets within the packet-based network. In these 
embodiments, the video data packets would replace or 
supplement the voice data packets within the above de- 
scribed implementations. The operation of embodi- 
ments with this feature would operate the same as de- 
scribed herein above with these video signals preferably 
corresponding to the primary talker. Alternatively, a 
manual control within the control plane could be added 
so that each participant or a moderator could select 
which video stream to view. Further, a picture-in-picture 
feature could be used such that two or more video 
streams could be shown at once. In the case of there 
being primary and secondary talkers, the picture-in-pic- 
ture operation could be equivalent to the mixing of the 
corresponding voice signals. 

[0093] In general although the operation of the 
present invention was described herein above with use 
of the terms voice data packets and voice signals, these 
packets and signals can be referred to broadly as media 
data packets and media signals respectively. In this 
case, media data packets are any data packets that are 
transmitted via the media plane, these media data pack- 
ets preferably being either audio or audio/video data 
packets. It is noted that use of the term voice data pack- 
ets above is specific to the preferred embodiments in 
which the audio signals are voice. Further, it should be 
understood that video data packets may incorporate au- 
dio data packets. 

[0094] Although the present invention herein above 
described has a single voice conference being estab- 
lished with the use of a central conference bridge, it 
should be understood that the central conference bridge 
would preferably be capable of handling a plurality of 
voice conferences simultaneously. 
[0095] Persons skilled in the art will appreciate that 
there are yet more alternative implementations and 
modifications possible for implementing the present in- 
vention, and that the above implementation is only an 
illustration of this embodiment of the invention. The 
scope of the invention, therefore, is only to be limited by 



the claims appended hereto. 
Claims 

5 

1 . A conference bridge, comprising: 

means for receiving at least one media data 
packet from at least two sources forming a me- 

10 dia conference, each media data packet defin- 

ing a compressed media signal; 
means for determining at least one speech pa- 
rameter corresponding to each of the com- 
pressed media signals; and 

15 means for selecting a set of the sources within 

the media conference as talkers based on the 
determined speech parameters. 

2. A conference bridge according to claim 1 , wherein 
20 the media data packets are audio data packets and 

the compressed media signals defined by the media 
data packets are compressed audio signals. 

3. A conference bridge according to claim 2, wherein 
25 the speech parameter corresponding to each of the 

compressed media signals is a number of bytes 
within each of the compressed media signals. 

4. A conference bridge according to claim 2, wherein 
30 the speech parameter corresponding to each of the 

compressed media signals is a pitch value within 
each of the corresponding media data packets. 

5. A conference bridge according to claim 2, wherein 
35 the speech parameter corresponding to each of the 

compressed media signals is an energy level cor- 
responding to each of the compressed media sig- 
nals. 

40 6. A conference bridge according to claim 1 , wherein 
the media data packets are audio/video data pack- 
ets and the compressed media signals defined by 
the media data packets are compressed audio/vid- 
eo signals. 

45 

7. A conference bridge according to any one of claims 
1 to 6, wherein the means for selecting a set of the 
sources within the media conference as talkers 
comprises: 

50 

means for determining whether each of the re- 
ceived compressed media signals contains 
speech based on the corresponding speech 
parameters; 

55 means for determining whether each of the 

compressed media signals containing speech 
correspond to a previously selected talker; 
means for determining whether a maximum 
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number of talkers parameter is met; 
means for discarding each of the compressed 
media signals containing speech that do not 
correspond to a previously selected talker in the 
case that the maximum number of talkers pa- 5 
rameter is met; and 

means for selecting as a talker within the media 
conference a source corresponding to each of 
the compressed media signals containing 
speech that do not correspond to a previously 10 
selected talker in the case that the maximum 
number of talkers parameter is not met. 

8. A conference bridge according to any one of claims 

1 to 7 further comprising means for outputting me- is 
dia data packets that correspond to the set of the 
sources selected as talkers within the media con- 
ference. 

9. A conference bridge according to claim 8, wherein 20 
the means for outputting comprises: 

means for determining whether each of the re- 
ceived compressed media signals correspond 
to a talker within the media conference; and 25 
means for outputting each of the received com- 
pressed media signals that correspond to a 
talker to the sources within the media confer- 
ence except the source corresponding to the 
received compressed media signal. 30 



1 2. A conference bridge according to any one of claims 
1 0 and 1 1 , wherein the means for processing com- 
prises means for determining at least one speech 
parameter associated with each of the compressed 
media signals and means for selecting a set of the 
sources within the media conference as talkers 
based upon the determined speech parameters. 

13. A conference bridge according to any one of claims 
8 and 1 0 to 1 2, wherein the set of the sources within 
the media conference selected as talkers compris- 
es one of first and second sources selected within 
the media conference as primary and secondary 
talkers respectively, one of the sources selected 
within the media conference as a lone talker, and 
none of the sources selected within the media con- 
ference as a talker. 

14. A conference bridge according to claim 13, wherein 
the means for outputting comprises: 

means for determining whether each of the re- 
ceived compressed media signals correspond 
to the lone talker within the media conference; 
and 

means for outputting each of the received com- 
pressed media signals that correspond to the 
lone talker to the sources within the media con- 
ference except the source corresponding to the 
compressed media signal. 



10. A conference bridge, comprising: 

means for receiving at least one media data 
packet from at least two sources forming a me- 35 
dia conference, each media data packet defin- 
ing a compressed media signal; 
means for processing the received com- 
pressed media signals including means for se- 
lecting a set of the sources within the media 40 
conference as talkers, one of the talkers being 
a lead talker; and 

means for outputting media data packets that 
correspond to the lead talker always in the 
same order as the media data packets which 
are received from the lead talker. 

1 1 . A conference bridge according to claim 1 0, wherein 
each of the media data packets received from the 
sources comprises a time stamp; and 50 

wherein the means for receiving comprises 
means for saving the time stamps corresponding to 
the media data packets received from the lead talk- 
er and the means for outputting comprises means 
for inserting the saved time stamps within the cor- ss 
responding media data packets output from the 
conference bridge. 



1 5. A conference bridge according to claim 13, wherein 
the means for outputting comprises: 

means for determining whether each of the re- 
ceived compressed media signals correspond 
to one of the primary and secondary talkers 
within the media conference; and 
means for outputting each of the received com- 
pressed media signals that correspond to one 
of the primary and secondary talkers to the 
sources within the media conference except 
the source corresponding to the compressed 
media signal. 

16. A conference bridge according to claim 1 3, wherein 
the means for outputting comprises: 

means for determining whether each of the re- 
ceived compressed media signals correspond 
to the secondary talker within the media con- 
ference; and 

means for outputting each of the received com- 
pressed media signals that correspond to the 
secondary talker to the primary talker within the 
media conference; 

means for determining whether each of the re- 
ceived compressed media signals that corre- 
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spond to the secondary talker have been gen- 
erated for previously; means for saving each of 
the received compressed media signals that 
correspond to the secondary talker and have 
not been previously generated for; and 5 
means for discarding each of the received com- 
pressed media signals that correspond to the 
secondary talker and have been previously 
generated for. 

w 

1 7. A conference bridge according to claim 1 3, wherein 
the means for outputting comprises: 

...means for determining whether each of the re- 
ceived compressed media signals correspond '5 
to the secondary talker within the media con- 
ference; 

means for outputting each of the received com- 
pressed media signals that correspond to the 
secondary talker to the primary talker within the 20 
media conference; 

means for determining whether each of the re- 
ceived compressed media signals that corre- 
spond to the secondary talker have been gen- 
erated for previously; means for decompress- 25 
ing and saving each of the received com- 
pressed media signals, resulting in secondary 
media signals, that correspond to the second- 
ary talker and have not previously been gener- 
ated for; and 30 
means for discarding each of the received com- 
pressed media signals that correspond to the 
secondary talker and have been previously 
generated for. 

35 

1 8. A conference bridge according to claim 1 3, wherein 
the means for outputting comprises: 

means for determining whether each of the re- 
ceived compressed media signals correspond 40 
to the primary talker within the media confer- 
ence; 

means for outputting each of the received com- 
pressed media signals that correspond to the 
primary talker to the secondary talker within the 45 
media conference; 

means for decompressing each of the received 
compressed media signals that correspond to 
the primary talker, resulting in primary media 
signals; so 
means for determining for each primary media 
signal whether a corresponding secondary me- 
dia signal is saved; 

means for generating a corresponding second- 
ary media signal if a corresponding secondary 55 
media signal is not saved; 
means for mixing each of the corresponding pri- 
mary and secondary media signals into a single 



combined media signal; and 
means for outputting each of the combined me- 
dia signals to the sources within the media con- 
ference except the primary and secondary talk- 
ers. 

19. A conference bridge according to claim 18, wherein 
the means for mixing comprises means for decom- 
pressing each of the secondary media signals prior 
to mixing it with the corresponding primary media 
signal if the secondary media signal is saved only 
in compressed form. 

20. A conference bridge according to claim 13, wherein 
the means for outputting comprises: 

means for determining whether each of the re- 
ceived compressed media signals correspond to 
the primary talker within the media conference; 

means for outputting each of the received com- 
pressed media signals that correspond to the 
primary talker to the secondary talker within the 
media conference; 

means for decompressing each of the received 
compressed media signals that correspond to 
the primary talker, resulting in primary media 
signals; 

means for determining for each primary media 
signal whether a corresponding secondary me- 
dia signal is saved; 

means for monitoring for receipt of a media data 
packet from the secondary talker for a prede- 
termined time period if a corresponding sec- 
ondary media signal is not saved; 
means for generating a corresponding second- 
ary media signal if the predetermined time pe- 
riod expires and no media data packet corre- 
sponding to the secondary talker has been re- 
ceived; 

means for mixing each of the corresponding pri- 
mary and secondary media signals into a single 
combined media signal; and 
means for outputting each of the combined me- 
dia signals to the sources within the media con- 
ference except the primary and secondary talk- 
ers. 

21 . A conference bridge according to claim 13, wherein 
the means for outputting comprises: 

means for determining whether each of the re- 
ceived compressed media signals correspond 
to the primary talker within the media confer- 
ence; 

means for encapsulating and outputting each 
of the received compressed media signals that 
correspond to the primary talker to the second- 
ary talker within the media conference; 
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means for determining for each of the received 
compressed media signals corresponding to 
the primary talker whether a corresponding 
compressed media signal associated with the 
secondary talker is saved; 
means for generating a corresponding com- 
pressed media signal for the secondary talker 
if a corresponding compressed media signal 
associated with the secondary talker is not 
saved; 

means for encapsulating each set of the com- 
pressed media signals corresponding to the pri- 
mary and secondary talkers into a combined 
media data packet; and 

means for outputting each of the combined me- 
dia data packets to the sources within the me- 
dia conference except the primary and second- 
ary talkers. 

22. A conference bridge according to claim 1 3, wherein 
the means for outputting comprises: 

means for determining whether each of the re- 
ceived compressed media signals correspond to 
the primary talker within the media conference; 
means for encapsulating and outputting each the 
compressed media signals that correspond to the 
primary talker to the secondary talker within the me- 
dia conference; 

means for determining for each of the com- 
pressed media signals corresponding to the pri- 
mary talker whether a corresponding com- 
pressed media signal associated with the sec- 
ondary talker is saved; 

means for monitoring for receipt of a corre- 
sponding media data packet from the second- 
ary talker for a predetermined time period if a 
corresponding compressed media signal asso- 
ciated with the secondary talker is not saved; 
means for generating a corresponding com- 
pressed media signal for the secondary talker 
if the predetermined time period expires and no 
media data packet corresponding to the sec- 
ondary talker has been received; 
means for encapsulating each set of the com- 
pressed media signals corresponding to the pri- 
mary and secondary talkers into a combined 
media data packet; and 

means for outputting each of the combined me- 
dia data packets to the sources within the me- 
dia conference except the primary and second- 
ary talkers. 

23. A conference bridge according to any one of claims 
8 and 10 to 12, wherein the set of the sources within 
the media conference selected as talkers compris- 
es one of first, second and third sources selected 
within the media conference as primary, secondary 



and tertiary talkers, first and second sources select- 
ed within the media conference as primary and sec- 
ondary talkers respectively, one of the sources se- 
lected within the media conference as a lone talker, 
5 and none of the sources selected within the media 
conference as a talker. 

24. A conference bridge according to claim 23, wherein 
the means for outputting comprises: 

10 

means for determining whether each of the re- 
ceived compressed media signals correspond 
to the tertiary talker within the media confer- 
ence; and 

'5 means for outputting each of the received com- 

pressed media signals that correspond to the 
tertiary talker to the primary and secondary 
talkers within the media conference. 

20 25. A conference bridge according to claim 23, wherein 
the means for outputting comprises: 

means for determining whether each of the re- 
ceived compressed media signals correspond 
25 to the tertiary talker within the media confer- 

ence; 

means for decompressing each of the com- 
pressed media signals that correspond to the 
tertiary talker, resulting in tertiary media sig- 
30 nals; means for separately mixing each of the 

tertiary media signals with corresponding pri- 
mary and secondary media signals to generate 
first and second mixed media signals respec- 
tively; and 

35 means for outputting each of the first and sec- 

ond mixed media signals to the secondary and 
primary talkers respectively within the media 
conference. 

40 26. A method for selecting a set of talkers within a me- 
dia conference, comprising: 

receiving at least one media data packet from 
at least two sources forming a media confer- 
45 ence, each media data packet defining a com- 

pressed media signal; 

determining at least one speech parameter cor- 
responding to each of the compressed media 
signals; and 

50 selecting a set of the sources within the media 

conference as talkers based on the determined 
speech parameters. 

27. A method according to claim 26 further comprising, 
55 for each of the received compressed media signals: 

determining whether the compressed media 
signal corresponds to a talker within the media 
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conference; and 

if determ ined that the compressed media signal 
corresponds to a talker, encapsulating the com- 
pressed media signal and outputting the encap- 
sulated compressed media signal to the sourc- 5 
es within the media conference except the 
source corresponding to the compressed me- 
dia signal. 



ets output from the conference bridge and to 
process these received media data packets in- 
cluding performing a jitter buffering operation, 
the jitter buffering operations being performed 
within the packet-based apparatus only. 

30. A method of processing compressed media signals 
within a media conference, the method comprising: 



28. A packet-based network comprising a conference 
bridge and a plurality of packet-based terminals; 

wherein at least two of the plurality of packet- 
based terminals operates to output media data 
packets comprising compressed media sig- 
nals, these packet-based terminals together 
forming a media conference; 
wherein the conference bridge operates to re- 
ceive the media data packets from the packet- 
based terminals within the media conference; 
to process the compressed media signals cor- 
responding to the received media data packets 
including selecting a set of the packet-based 
terminals within the media conference as talk- 
ers; and to output media data packets that cor- 
respond to the compressed media signals re- 
ceived from the talkers; and 
wherein at least one of the packet-based termi- 
nals within the media conference further oper- 
ates to receive the media data packets output 
from the conference bridge and to process 
these received media data packets including 
performing a jitter buffering operation, the jitter 
buffering operations being performed within the 
packet-based terminals only. 

29. A network comprising a packet-based network, a 
conference bridge coupled to the packet-based net- 
work, a non-packet-based telephone network, at 
least one packet-based apparatus coupled be- 
tween the packet-based network and the non-pack- 
et-based telephone network, and a plurality of 
sources for media signals that are each coupled to 
the non-packet-based telephone network; 

wherein the conference bridge comprises con- 
ferencing control logic to receive at least one 
media data packet from at least two of the 
sources forming a media conference, each me- 
dia data packet defining a compressed media 
signal; to process the received compressed 
media signals including selecting a set of the 
sources within the mediaconference as talkers; 
and to output media data packets that corre- 
spond to the compressed media signals re- 
ceived from the talkers; and 
wherein at least one of the packet -based appa- 
ratus operates to receive the media data pack- 



10 receiving at least one compressed media pack- 

et from at least two sources forming the media 
conference, each media data packet defining a 
compressed media signal; 
processing the received compressed media 

15 signals including selecting a set of the sources 

within the media conference as talkers; 
outputting media data packets that correspond 
to compressed media signals received from the 
talkers; 

20 receiving the media data packets that corre- 

spond to compressed media signals received 
from the talkers at one or more packet-based 
apparatus; and 

processing the received compressed media 
25 signals including performing a first and only jit- 

ter buffering operation. 

31 . A method according to claim 30, wherein the one or 
more packet-based apparatus each comprise one 

30 of the sources forming the media conference. 

32. A method according to one of claims 30 and 31 , 
wherein the processing the received compressed 
media signals further includes a decompression op- 

35 eration which outputs uncompressed media signals 
corresponding to the received compressed media 
signals; and 

wherein the method further comprises for- 
warding the uncompressed media signals to at least 

40 one of the sources forming the media conference. 

33. A packet-based apparatus, comprising: 

means for receiving a media data packet from 
45 a conference bridge, the media data packet de- 

fining two or more compressed media signals; 
means for performing initial processing of the 
received media data packet comprising remov- 
ing the packet overhead; 
50 means for decompressing each of the com- 

pressed media signals in order to generate cor- 
responding uncompressed media signals; 
means for mixing the uncompressed media sig- 
nals into a combined media signal; and 
55 means for outputting the combined media sig- 

nal. 

34. A packet-based apparatus, comprising: 
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means for receiving a media data packet from 
a conference bridge, the media data packet de- 
fining a compressed media signal; 
means for performing initial processing of the 
received media data packet comprising remov- s 
ing the packet overhead; 
means for decompressing the compressed me- 
dia signal in order to generate a first uncom- 
pressed media signal; 

means for identifying at least one other uncom- 10 
pressed media signal that corresponds to the 
first uncompressed media signal; 
means for mixing the first uncompressed media 
signal with the other uncompressed media sig- 
nal into a combined media signal; and ? s 
means for outputting the combined media sig- 
nal. 

35. A packet-based apparatus according to claim 34, 
wherein the means for identifying comprises means 20 
for determining a first identification item within the 
packet overhead of the received media data packet 
and means for locating at least one other uncom- 
pressed media signal that corresponds to a re- 
ceived media data packet comprising a second 25 
identification item that relates to the first identifica- 
tion item. 

36. A packet-based apparatus according to claim 35, 
wherein the first and second identification items 30 
comprise time stamps. 

37. A packet-based apparatus according to any one of 
claims 33 to 36, wherein the means, for initial 
processing further comprises buffering each of the 35 
compressed media signals for jitter after the remov- 
ing of the packet overhead from the received media 
data packet. 

38. A packet-based apparatus according to any one of 40 
claims 33 to 36, further comprising means for buff- 
ering each of the uncompressed media signals for 
jitter prior to the signals being mixed by the means 

for mixing. 

45 

39. A packet-based apparatus according to any one of 
claims 33 to 38 further comprising: 

means for receiving a second media data pack- 
et from the conference bridge, the second me- so 
dia data packet defining a single compressed 
media signal; 

means for performing initial processing of the 
received second media data packet comprising 
removing the packet overhead; 55 
means for decompressing the single com- 
pressed media signal in order to generate a sin- 
gle uncompressed media signal; and 



means for outputting the single uncompressed 
media signal. 

40. A packet-based apparatus according to any one of 
claims 33 to 39 further comprising a speaker cou- 
pled to the means for outputting the combined me- 
dia signal to receive the combined media signal and 
broadcast audio signals corresponding to the re- 
ceived combined media signal. 

41. A packet-based network interface comprising a 
packet-based apparatus according to any one of 
claims 33 to 39, wherein the combined media signal 
is arranged to be output, via a non-packet-based 
network, to a non-packet-based telephone terminal. 

42. A method of outputting a combined media signal 
comprising: 

receiving a media data packet from a confer- 
ence bridge, the media data packet defining 
two or more compressed media signals; 
performing initial processing of the received 
media data packet comprising removing the 
packet overhead; 

decompressing each of the compressed media 
signals in order to generate corresponding un- 
compressed media signals; 
mixing the uncompressed media signals into a 
combined media signal; and 
outputting the combined media signal. 

43. A method of outputting a combined media signal 
comprising: 

receiving a media data packet from a confer- 
ence bridge, the media data packet defining a 
compressed media signal; 
performing initial processing of the received 
media data packet comprising removing the 
packet overhead; 

decompressing the compressed media signal 
in order to generate a first uncompressed me- 
dia signal; 

identifying at least one other uncompressed 
media signal that corresponds to the first un- 
compressed media signal; 
mixing the first uncompressed media signal 
with the other uncompressed media signal into 
a combined media signal; and 
outputting the combined media signal. 

44. A conference bridge, comprising: 

means for receiving at least one first media data 
packet from at least one source within a media 
conference, each first media data packet defin- 
ing a first compressed media signal; 
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means for receiving at least one second media 
data packet from at least one other conference 
bridge, each second media data packet defin- 
ing at least one second compressed media sig- 
nal corresponding to a particular source within 5 
the media conference; and 
means for selecting a set of the sources within 
the media conference as talkers based upon 
the compressed media signals within both the 
first and second media data packets. 10 

45. A conference bridge according to one of claim 44, 
wherein each of the second media data packets 
comprises a single compressed media signal, the 
compressed media signal corresponding to one of 
a lone talker, a primary talker and a secondary talker 
selected by the other conference bridge in which the 
particular media data packet was received from. 

46. A conference bridge according to any one of claims 
44 and 45 further comprising: 

means for encapsulating compressed media 
signals corresponding to the selected talkers; 
means for outputting these encapsulated com- 25 
pressed media signals to the source that the 
conference bridge receives the first media data 
packet unless the particular source is selected 
as a talker; and 

means for outputting these encapsulated com- 30 
pressed media signals to the at least one other 
conference bridge. 

47. A conference bridge according to any one of claims 

44 and 45 further comprising: 35 

means for encapsulating compressed media 
signals corresponding to the selected talkers; 
and 

means for outputting these encapsulated com- 40 
pressed media signals to the at least one other 
conference bridge unless the particular com- 
pressed media signals were received from the 
at least one other conference bridge prior to re- 
ceiving the signals from another source, 



receiving at least one first media data packet 
from at least one source within a media confer- 
ence, each first media data packet defining a 
first compressed media signal; 
receiving at least one second media data pack- 
et from at least one other conference bridge, 
each second media data packet defining at 
least one second compressed media signal 
corresponding to a particular source within the 
media conference; and 

selecting a set of the sources within the media 
conference as talkers based upon the com- 
pressed media signals within both the first and 
second media data packets. 

50. A packet-based apparatus, comprising: 

a receiver capable of being coupled to a net- 
work to receive at least one first media data 
packet from a first source within a media con- 
ference, each first media data packet defining 
a first compressed media signal; receive at 
least one second media data packet from a sec- 
ond source within the media conference, each 
second media data packet defining at least one 
second compressed media signal; and perform 
initial processing of the received first and sec- 
ond media data packet comprising removing 
the packet overhead; and 
an output unit coupled to the receiver to decom- 
press each of the first and second compressed 
media signals in order to generate correspond- 
ing first and second uncompressed media sig- 
nals, mix the first and second uncompressed 
media signals into a combined media signal, 
and output the combined media signal. 



48. A conference bridge according to claims 44 to 47, 
wherein the means for selecting comprises means 
for determining at least one speech parameter cor- 
responding to each of the first and second com- 
pressed media signals and means for selecting a 
set of the sources within the media conference as 
talkers based on the determined speech parame- 
ters. 

49. A method for selecting a set of talkers within a me- 
dia conference, comprising: 
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(54) Apparatus and method for packet-based media communications 



(57) Packet-based central conference bridges, 
packet-based network interfaces and packet-based ter- 
minals are used for voice communications over a pack- 
et-based network. Modifications to these apparatuses 
can reduce the latency and the signal processing re- 
quirements while increasing the signal quality within a 
voice conference as well as point-to-point communica- 
tions. For instance, by selecting the talkers prior to the 
decompression of the voice signals, decreases in the 



latency and increases in signal quality within the voice 
conference can result due to a possible removal of the 
decompression and subsequent compression opera- 
tions in a conference bridge unnecessary in some cir- 
cumstances. Further, the removal of the jitter buffers 
within the conference bridges and the moving of the mix- 
ing operation to the individual terminals and/or network 
interfaces are modifications that can cause lower laten- 
cy and transcoding within the voice conference. 
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